Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

De Novo Genome Assembly ◾ 91

The overlap-consensus algorithm performs pairwise alignment and then it represents the

reads and the overlaps between the reads with graphs, where contiguous reads are the

vertexes (nodes) and their overlaps (common prefix or suffix) are the edges. Each node (ri)

corresponds to a read and any two reads are connected by an edge e r r

(

)

2 ^{where the suffix}

of the first read matches the prefix of the second read. The algorithm then attempts to find

the Hamiltonian path of the graph that includes the vertexes (reads) exactly once. Contigs

are then created from the consensus sequences of the overlapped suffixes and prefixes. The

downside of this method is that it requires a large sequencing coverage and the complexity

of the Hamiltonian path increases with the increase of reads. An example for an assembler

using this approach is Celera Assembler [4]. To show you how the approach of the overlap-

consensus graphs works, assume that we have the reads:

CAGAACCTAGC

ATAGCAGAACG

GCTAAGCAGTG

AGTGTCACGAC

TAGCAACTATG

AACGTAGCAGA

Figure 3.3 and 3.4 show the overlap graphs in which the reads are the nodes and the over-

lapped prefixes and suffixes are the edges. The Hamiltonian path includes each node

exactly once to form the consensus sequence.

3.1.3 De Bruijn Graphs

De Bruijn graph [5, 6] approach for de novo assembly was first introduced in 2001 for assem-

bling short reads produced by the NGS technologies and then it was refined in 2008 [7]. In

de Bruijn graphs, each read is broken into overlapping substrings of length k called over-

lapping k-mers. The k-mers then are represented by graphs in which each k-mer is a vertex

(node). Any two nodes of any two k-mers that share a common prefix and suffix of length

FIGURE 3.2 Greedy assembly.